Skip to content

NMS-19647: Fix out-of-order sample loss by making remote write synchronous#109

Open
cnewkirk wants to merge 1 commit intoOpenNMS:masterfrom
cnewkirk:bugfix/improve-prom-writespec-compliance
Open

NMS-19647: Fix out-of-order sample loss by making remote write synchronous#109
cnewkirk wants to merge 1 commit intoOpenNMS:masterfrom
cnewkirk:bugfix/improve-prom-writespec-compliance

Conversation

@cnewkirk
Copy link
Copy Markdown
Contributor

The store() method previously fired HTTP writes asynchronously via executeAsync() and returned immediately. When the RingBuffer's multiple worker threads dispatched consecutive batches containing samples for the same series, the async HTTP requests could arrive at the remote write endpoint out of timestamp order, causing the backend to reject the stale samples as out-of-order.

This change makes store() block until the HTTP write completes, ensuring the ring buffer worker thread does not process the next batch until the current write has landed. This preserves per-series timestamp ordering across consecutive WriteRequests as required by the Prometheus Remote Write spec.

Additionally fixes a bug where samplesLost incorrectly counted unfiltered samples (including NaN) instead of the actual samples that were attempted.

Validated via A/B E2E testing against Thanos Receive:

  • Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss)
  • Fix (sync): 0 out-of-order / 5,264 appended (0.00% loss)
  • Throughput: identical (~5,260 samples over equal soak periods)
  • All 45 smoke tests passing on both runs

@marshallmassengill
Copy link
Copy Markdown

Created https://opennms.atlassian.net/browse/NMS-19647 for this one.

@marshallmassengill marshallmassengill changed the title Fix out-of-order sample loss by making remote write synchronous NMS-19647: Fix out-of-order sample loss by making remote write synchronous Apr 27, 2026
@cnewkirk cnewkirk force-pushed the bugfix/improve-prom-writespec-compliance branch from c7f4ba6 to ea8d098 Compare April 27, 2026 16:04
The store() method previously fired HTTP writes asynchronously via
executeAsync() and returned immediately. When the RingBuffer's
multiple worker threads dispatched consecutive batches containing
samples for the same series, the async HTTP requests could arrive
at the remote write endpoint out of timestamp order, causing the
backend to reject the stale samples as out-of-order.

This change makes store() block until the HTTP write completes,
ensuring the ring buffer worker thread does not process the next
batch until the current write has landed. This preserves per-series
timestamp ordering across consecutive WriteRequests as required by
the Prometheus Remote Write spec.

Additionally fixes a bug where samplesLost incorrectly counted
unfiltered samples (including NaN) instead of the actual samples
that were attempted.

Validated via A/B E2E testing against Thanos Receive:
  - Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss)
  - Fix (sync):        0 out-of-order / 5,264 appended (0.00% loss)
  - Throughput: identical (~5,260 samples over equal soak periods)
  - All 45 smoke tests passing on both runs

Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
@cnewkirk cnewkirk force-pushed the bugfix/improve-prom-writespec-compliance branch from ea8d098 to b6e6b75 Compare April 27, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants